Design and Implementation of the Slovenian Phonetic and Morphology Lexicons for the Use in Spoken Language Applications
نویسندگان
چکیده
Phonetic and Morphology Lexicons that can be used in Spoken Language Applications are costly and time-consuming to build. This paper reports on a project aiming at the semi-automatic development of large phonetic (SIflex) and morphology (SImlex) lexicons for Slovenian language. The main goal of the project is to build the phonetic and morphology lexicon for Slovenian language that will be used within the framework of various applications in speech processing (e.g. speech synthesis and recognition), natural language processing (e.g. spell checking) and for studying and assessing automatic grapheme-to-phoneme transcription. In automatic speech recognition one of the major problem is extremely high variability of pronunciations. One part of this variability can be taken into account through a training of the acoustic-phonetic units from a large amount of data. Another part of variability must be modeled in the lexicon as pronunciation variants. In the case of text-to-speech systems it is also very usable to be able to detect homographs and choose the correct pronunciation according to the context information. All this was our motivation for developing both lexicons for Slovenian language. Currently the created phonetic lexicon (SIflex) contains more than 130.000 items, whereas the morphology lexicon (SImlex) consists of approximately 600.000 inflected forms, including information on the orthography, pronunciation, stress and morphosyntactic features, as defined in the framework of the Multext project.
منابع مشابه
Efficient Development of Lexical Language Resources and their Representation
Statistical approaches in speech technology, whether used for statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g., text corpora, pronunciation and morphology lexicons, and speech databases. This paper presents a system architecture for the rapid construction of morphologic and phonetic lexico...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملBorrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab
“Borrowing” is a lingual process that is studied in diachronic linguistics. In this process a language borrows elements from another language. This process usually occurs in areas that two languages make contact with each other. In a dialect spoken in South Khorasan the language borrowing happens. Arabs living in this part of Iran probably have immigrated in the early centuries of Islam. In thi...
متن کاملVague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation
This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...
متن کاملMHSubLex: Using Metaheuristic Methods for Subjectivity Classification of Microblogs
In Web 2.0, people are free to share their experiences, views, and opinions. One of the problems that arises in web 2.0 is the sentiment analysis of texts produced by users in outlets such as Twitter. One of main the tasks of sentiment analysis is subjectivity classification. Our aim is to classify the subjectivity of Tweets. To this end, we create subjectivity lexicons in which the words into ...
متن کامل